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Abstract 

We  focus  on  the  problem  of  finding  patterns  across  two  large,  multidimensional  datasets.  For  example,  given 
feature  vectors  of  healthy  and  of  non-healthy  patients,  we  want  to  answer  the  following  questions:  Are  the 
two  clouds  of  points  separable?  What  is  the  smallest/largest  pair-wise  distance  across  the  two  datasets? 
Which  of  the  two  clouds  does  a  new  point  (feature  vector)  come  from? 

We  propose  a  new  tool,  the  tri-plot,  and  its  generalization,  the  pq-plot,  which  help  us  answer  the  above 
questions.  We  provide  a  set  of  rules  on  how  to  interpret  a  tri-plot,  and  we  apply  these  rules  on  synthetic  and 
real  datasets.  We  also  show  how  to  use  our  tool  for  classification,  when  traditional  methods  (nearest  neighbor, 
classification  trees)  may  fail. 
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1  Introduction  and  motivation 


The  automatic  discovery  of  meaningful  patterns  and  relationships  hidden  in  vast  repositories  of  raw  infor¬ 
mation  has  become  an  issue  of  great  importance.  Multimedia  systems  for  satellite  images,  medical  data 
and  banking  information  are  some  examples  of  prolific  data  sources.  Many  of  these  data  are  inherently 
multi-dimensional.  It  is  often  difficult  to  summarize  a  large  number  of  attributes  by  extracting  a  few  essential 
features.  Moreover,  many  methods  proposed  in  the  literature  suffer  from  the  dimensionality  curse  and  are  im¬ 
practical  to  apply  directly.  Thus,  dealing  efficiently  with  high-dimensional  data  is  a  challenge  for  researchers 
in  the  database  field  [WSB98,  BBK98].  Things  become  worse  when  more  fhan  one  dafasefs  are  involved. 

We  propose  a  mefhod  for  exploring  fhe  relafionship  befween  fwo  mulfidimensional  dafasefs,  by  summa¬ 
rizing  fhe  information  abouf  fheir  relafive  position.  Our  mefhod  requires  only  a  single  pass  on  fhe  dafa  and 
scales  linearly  wifh  fhe  number  of  dimensions. 

Problem  definition  Given  fwo  large  multidimensional  dafasefs,  find  rules  abouf  fheir  relafive  placemenf  in 
space: 

Q1  Do  fhe  dafasefs  come  from  fhe  same  disfribufion? 

Q2  Do  fhey  repel  each  ofher? 

Q3  Are  fhey  close  or  far  away? 

Q4  Are  fhey  separable? 

Q5  For  a  given,  unlabelled  poinf,  which  of  fhe  fwo  sefs  does  if  come  from  (if  any)? 

In  fhe  following  secfion,  we  will  briefly  discuss  fhe  relafed  work  on  dafa  mining  techniques  and  describe 
fhe  dafasefs  we  used  in  our  experimenfs.  We  fhen  infroduce  fhe  cross-cloud  plofs  and  explain  fheir  properfies. 
Based  on  fhese,  we  presenf  a  sef  of  pracfical  rules  which  allow  us  fo  analyze  fwo  clouds  of  poinfs.  Finally, 
we  describe  fhe  algorifhm  for  generating  fhe  plofs. 

2  Related  work 

There  has  been  a  fremendous  amounf  of  work  on  dafa  mining  during  fhe  pasf  years.  Many  fechniques  have 
been  developed  fhaf  have  allowed  fhe  discovery  of  various  frends,  relafions  and  characteristics  wifh  large 
amounfs  of  dafa  [JAG99,  Cha98].  Defailed  surveys  can  be  found  in  [CHY96]  and  [GGR99].  Also,  [Fay98] 
confains  an  insighlful  discussion  of  fhe  overall  process  of  knowledge  discovery  in  dafabases  (KDD)  as  well 
as  a  comprehensive  overview  of  mefhods,  problems,  and  fheir  inherenf  characferisfics. 

In  fhe  field  of  spafial  dafa  mining  [EKS99]  much  recenf  work  has  focused  on  clustering  and  fhe  discov¬ 
ery  of  local  frends  and  characferizafions.  Scalable  algorifhms  for  exlracfing  clusters  from  large  collecfions 
of  spatial  dafa  are  presenfed  in  [NH94]  and  [KN96].  The  aufhors  also  combine  fhis  wifh  fhe  exlracfion  of 
characferislics  based  on  non-spafial  aflribufes  by  using  bofh  spafial  dominanf  and  non-spafial  dominanf  ap¬ 
proaches  (depending  on  whefher  fhe  clusfer  discovery  is  performed  inifially  or  on  subsefs  derived  using  non- 
spafial  aflribufes).  A  general  framework  for  discovering  frends  and  characterizations  among  neighborhoods 
of  dala-poinls  is  presented  in  [EFKS98].  This  framework  is  buill  on  lop  of  a  spatial  DBMS  and  ulilizes 
neighborhood-relationship  graphs  which  are  Iraversed  lo  perform  a  number  of  operalions.  Addilionally,  scal¬ 
able  cluslering  algorifhms  are  included  [AGGR98,  TZ96,  SCZ98,  ERB98]. 

The  work  on  fraclals  and  box-counling  plofs  is  relafed:  [BE95]  used  fhe  correlalion  fraclal  dimension 
of  a  dalasel  lo  eslimale  fhe  seleclivily  of  nearesl-neighbor  queries;  [ESJTOO]  gave  formulas  for  fhe  selecliv- 
ily  of  spatial  joins  across  fwo  poinl-sefs.  [BBKK97]  analyze  fhe  performance  of  nearesl-neighbor  queries, 
evenfually  using  fhe  fraclal  dimension.  More  remote  work  on  fraclals  includes  [PKEOO],  [JTWEOO],  [BCOO]. 
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Dafasef 

Description 

Synfhelic  dafasefs 

Line 

Poinfs  along  a  line  segmenf,  randomly  chosen. 

Circumference 

Poinfs  along  a  circle,  randomly  chosen. 

Sierpinsky 

Randomly  generated  poinfs  from  a  Sierpinsky  Iriangle  (see  fig.  7b). 

Square 

Poinfs  on  a  2D  manifold,  randomly  generated. 

Cube 

Poinfs  in  a  3D  manifold,  randomly  generafed. 

Super-clusfer 

256  uniformly  disfribufed  clusfers,  each  wifh  7x7  poinfs  in  a  2D  manifold. 

Real  dafasefs 

California 

Four  fwo-dimensional  sefs  of  poinfs  (oblained  from  UCI)  lhal  refer  fo  geographical 
coordinates  in  California  [oC89].  Each  sef  corresponds  fo  a  fealure:  ‘sfreefs’  (62,933 
poinfs),  ‘railways’  (31,059  poinfs),  ‘political’  borders  (46,850  poinfs),  and  nalural  ‘wa¬ 
fer’  systems  (72,066  poinfs). 

Iris 

Three  sefs  describing  properties  of  fhe  flower  species  of  genus  Iris.  The  poinfs  are 
4-dimensional  (sepal  lengfh,  sepal  widfh,  pefal  lengfh,  pefal  widfh);  fhe  species  are 
‘virginica’,  ‘versicolor’  and  ‘sefosa’  (50  poinfs  each).  This  is  a  well-known  dafasef  in 
fhe  machine  learning  liferafure. 

Galaxy 

Dafasefs  from  fhe  SLOAN  felescope:  (x,  y)  coordinafes,  plus  fhe  class  label.  There  are 
82,277  in  fhe  ‘dev’  class  (deVaucouleurs),  and  70,405  in  fhe  ‘exp’  class  (exponential). 

LC 

Cusfomer  dafa  from  a  large  corporafion  (confidenlial).  There  were  20,000  records  (be¬ 
longing  fo  fwo  classes  wifh  1,716  and  18,284  members  each),  each  wifh  19  numeri¬ 
cal/boolean  affribufes. 

Voles 

Two  16-dimensional  dafasefs  from  fhe  1984  Unifed  Sfafes  Congressional  Vofing 
Records  Dafabase:  ‘democraf’  (267  enfries)  and  ‘republican’  (168  enfries). 

Table  1 :  Description  of  datasets  used  for  exposition  and  testing  of  our  method. 


Almost  all  of  these  papers  use  fast,  linear  (or  0(iV  log  A^))  algorithms,  based  on  the  box-counting  method. 
We  also  use  a  similar  approach  for  our  tri-plots. 

Visualization  techniques  for  large  amounts  of  multidimensional  data  have  also  been  developed.  The  work 
described  in  [KK94]  presents  a  visualization  method  which  utilizes  views  of  the  data  around  reference  points 
and  effectively  reduces  the  amount  of  information  to  be  displayed  in  a  way  that  affects  various  characteristics 
of  the  data  (eg.  shape  and  location  of  clusters,  etc.)  in  a  controlled  manner. 

There  has  also  been  significant  work  on  data  mining  in  non-spatial,  multidimensional  databases.  Recent 
work  on  a  general  framework  that  incorporates  a  number  of  algorithms  is  presented  in  [iHLN99].  The  authors 
introduce  a  general  query  language  and  demonstrate  its  application  on  the  discovery  of  a  large  variety  of 
association  rules  which  satisfy  the  anti-monotonicity  property. 

However,  none  of  the  above  methods  can  answer  all  the  questions,  Q1  to  Q5,  which  we  posed  in  the 
previous  section.  The  method  proposed  in  this  paper  can  answer  such  questions.  To  find  a  solufion  for  fhe 
given  problem,  we  move  away  from  associafion  rules  and  focus  on  fhe  spatial  relafionships  befween  fwo 
multidimensional  dafasefs. 

2.1  Description  of  the  data  sets 

We  applied  our  mefhod  on  several  dafasefs,  bofh  synfhefic  and  real.  The  former  are  used  fo  build  infuifion, 
and  fhe  laffer  fo  validafe  our  fechniques.  The  synfhefic  dafasefs  are  always  normalized  fo  a  unif  hypercube 
and  fhey  may  be  franslafed,  rofafed  and/or  scaled  in  fhe  experimenfs.  The  dafasefs  are  described  in  fable  1 . 
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Symbol 

Definition 

Na  (or  Nb) 

No.  of  points  in  dataset  A  (or  B) 

Cross 

Cross  A, sir,  1, 1)  plot  between 
datasets  A  and  B 

Self  A 

Self  A^r,  1,1)  plot  of  dataset  A 

Wa 

Cross  A, sir,  10, 1)  eross-eould  plot 
weighted  on  dataset  A 

Wb 

Cross  A, sir,  1, 10)  eross-eould  plot 
weighted  on  dataset  B 

CA,i  {CB,i) 

Count  of  type  A  (B)  points  in  the  i-th  eell 

n 

No.  of  dimensions  (embedding  dimensionality) 

D2 

Correlation  fraetal  dimension 

^  min 

Est.  minimum  distanee  between  two  points 

^  max 

Est.  maximum  distanee  between  two  points 

Table  2:  Symbols  and  definitions 


3  Proposed  method:  cross-cloud  plots 


Our  approaeh  relies  on  a  novel  method  that  allows  fast  summarization  of  the  distribution  of  distanees  between 
points  from  two  sets  A  and  B.  Table  2  presents  the  symbols  used  in  this  paper.  Consider  a  grid  with  eells  of 
side  r  and  let  CA,i  {Cb.i)  be  the  number  of  points  from  set  A  (B)  in  the  i-th  eell.  The  eell  grid  partitions  the 
minimum  bounding  box  of  both  datasets.  The  cross-function  Cross/ a  q)  is  defined  as  follows: 

Definition  1  Given  two  data  sets  A  and  B  in  the  same  n-dimensional  space,  we  define  the  cross-function  of 
order  (p,  q)  as 

Cross/ A, sir, P,  q)  =  Y1 

i 

Typieally,  we  plot  the  eross-funetion  in  log-log  seales,  after  some  normalization.  The  normalization 
faetor  seales  the  plot,  maximizing  the  information  presented: 

Definition  2  Given  two  data  sets  A  and  B  (with  Na  and  Nb  points)  in  the  same  n-dimensional  space,  we 
define  the  cross-cloud  plot  as  the  plot  of 


CrossA,B{r,p,  q) 


logjNA  •  Nb) 
log{Nl  ■  iV|) 


log 


EcP  ni 

'-'A, 1^3,1 


versus  log(r) 


The  eross-funetion  has  several  desirable  properties: 

Property  1  For  p  =  q  =  1,  the  cross-function  is  proportional  to  the  count  of  A- B  pairs  within  distance  r. 
That  is. 

Cross  A, sir,  1, 1)  oc  (#  of  pairs  of  points  within  distance  <  r) 


Proof  Using  Sehuster’s  lemma  [Seh88]. 

This  is  an  important  property.  For  p  =  q  =  1,  the  eross-eloud  plot  gives  the  eumulative  distribution 
funetion  of  the  pairwise  distanees  between  the  two  “elouds”  A  and  B  [FSJTOO].  Beeause  of  its  importanee, 
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we  will  use  p  =  g  =  1  as  the  default  values.  We  will  also  omit  the  subseripts  A,  B  from  the  eross-eloud  plot 
when  it  is  elear  whieh  datasets  are  involved.  That  is, 

Cross{r)  =  Cross  a, sir)  =  Cross  a, sir,  1,1) 


Property  2  The  cross-function  includes  the  correlation  integral  as  a  special  case  when  we  apply  it  to  the 
same  dataset  (i.e.,  A  =  B). 

Proof  From  the  definition  of  eorrelation  integral  [Seh91]. 

The  eorrelation  integral  gives  the  eorrelation  fraetal  dimension  D2  of  a  dataset  A,  if  it  is  self-similar. 
Sinee  the  above  property  is  very  important,  we  shall  give  the  self  eross-eloud  plots  a  speeial  name: 

Definition  3  The  self-plot  of  a  given  dataset  A  is  the  plot  of 

a  u  I  \  1  (Ylii^A,i-{CA,i  —  ^)\  ^  ^ 

Self  Air)  =  log  (  — ^ - - -  I  versus  log(r) 

In  order  to  avoid  artifaets  that  self-pairs  generate,  self-plots  do  not  eount  self-pairs,  by  definition.  Moreover, 
minor  pairs  i{pi,P2)  and  {p2,Pi))  are  eounted  only  onee. 

Property  3  If  A  is  self  similar,  then  the  self-plot  of  A  is  linear  and  its  slope  is  its  intrinsic  dimensionality 
(correlation  fractal  dimension,  D2). 

Proof  See  [BF95]. 

We  are  now  ready  to  define  our  fwo  main  fools,  fhe  fri-plof  and  fhe  pg-plof. 

Definition  4  The  tri-plot  of  two  datasets,  A  and  B,  is  the  graph  which  contains  the  cross-plot  Cross  (r)  and 
the  normalized  self-plots  for  each  dataset  (Self  a{t)  -\-  \og(NA/NB)  and  Self  ^ir)  +  log{NB/NA))- 

The  normalization  faetors,  log{NA/NB)  and  log{NB/NA),  perform  only  translation,  preserving  the  steep¬ 
ness  of  the  graphs.  In  this  paper,  for  every  tri-plot  we  present  the  three  graphs  with  the  same  eolor  pattern: 
the  eross-plot  is  presented  in  blue  lines  with  diamonds.  Self  a  in  green  lines  with  erosses  and  Self  b  in  red 
lines  with  squares.  We  also  show  the  slope  (or  steepness)  of  the  fitted  lines. 

Definition  5  The  pq-plot  of  two  datasets,  A  and  B,  is  the  graph  of  the  three  cross-cloud  plots:  Cross  a, sir). 
Cross  A, sir,  1,  k),  and  Cross  a, 3(1",  k,  l)for  large  values  ofk  (/c  3>  Ij. 

Fig.  1  shows  the  tri-plot  and  pq-plot  for  the  Line  and  Sierpinsky  datasets.  Notiee  that,  although  the 
Cross{)  is  almost  always  linear  (tig.  la),  this  is  not  neeessarily  true  for  the  Cross{r,  1,  k)  and  Cross{r,  k,  1) 
(in  fig.  lb,  k  =  10). 

Definition  6  The  steepness  of  a  plot  is  its  slope,  as  determined  by  fitting  a  line  with  least-squares  regression. 

The  tri-plots  allow  us  to  determine  the  relationship  between  the  two  datasets.  If  they  are  self-similar  (ie.  both 
their  self-plots  are  linear  for  a  meaningful  range  of  radii),  their  slopes  ean  be  used  in  the  eomparisons  that 
follow.  However,  the  proposed  analysis  ean  be  applied  even  to  datasets  whieh  are  not  self-similar  (ie.  do 
not  have  linear  self-plots).  Thus,  we  will  in  use  the  terms  steepness  and  similarity  (as  defined  above).  The 
pq-plot  is  used  in  a  further  analysis  step.  Its  use  is  more  subtle  and  is  diseussed  in  seetion  4.3. 
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Tri-Plot:  Sierpinsky15K  X  LinelOK 


logCradii)  ° 


Figure  1:  Sierpinsky  and  Line  datasets:  (a)  the  tri-plot,  (b)  the  pg-plot.  The  cross-plots  are  presented  in  blue 
with  diamonds,  the  self-  and  weighted-Sierpinsky  plots  in  green  with  crosses,  and  the  self-  and  weighted-Line 
in  red  with  squares. 

3.1  Anatomy  of  the  proposed  plots 

This  section  shows  how  to  “read”  the  cross-cloud  plots  and  take  advantage  of  the  tri-  and  pq-plots,  without 
any  extra  calculations  on  the  datasets. 

3.1.1  Properties  of  the  self-plots 

Property  4  The  first  radius  for  which  the  count-of-pairs  is  not  zero  in  the  self-plot  provides  an  accurate 
estimate,  rmin,  of  the  minimum  distance  between  any  two  points. 

Property  5  Similarly,  the  radius  up  to  which  the  count-of-pair  increases  (being  constant  for  larger  radii) 
provides  an  accurate  estimate,  Vmax,  of  the  maximum  distance  between  any  two  points.  We  also  refer  to  this 
distance  as  the  dataset  diameter. 

Fig.  2  illustrates  the  above  properties.  The  lower  row  of  fig.  2a  shows  a  line  with  15,000  points.  Its  self-plot 
is  linear.  The  slope,  which  is  D2,  is  equal  to  1,  as  expected  (since  this  is  the  intrinsic  dimensionality  of  a 
line).  The  fmin  and  Vmax  estimates  are  also  indicated. 

Property  6  If  the  dataset  consists  of  clusters,  the  self -plot  has  a  plateau  from  radius  fmin  to  f^ax  i^oefig.  2). 

Whenever  the  self-plot  is  piecewise  linear,  the  dataset  has  characteristic  scales.  Plateaus  are  of  particular 
interest;  these  occur  when  the  dataset  is  not  homogeneous.  From  the  endpoints  of  the  plateau,  we  can 
accurately  estimate  the  maximum  cluster  diameter,  fcdmax,  and  characteristic  separation  between  clusters, 
fsepc-  This  occurs  in  the  self-plot  of  the  Super-cluster  dataset  (fig.  2b). 

3.1.2  Properties  of  the  cross-cloud  plot 

Fig.  3  presents  an  example  of  a  tri-plot,  where  dataset  A  is  a  randomly  generated  set  of  6,000  points  from 
a  line  (y  =  xO/x,  y  G  [0, 1]),  and  dataset  is  a  Sierpinsky  triangle  with  6,561  points.  These  two  datasets 
where  chosen  to  highlight  some  interesting  plot  properties.  These  are  discussed  in  the  following  (see  also 
fig.  3). 

Property  7  The  minimum  distance  between  the  datasets  can  be  accurately  estimated  as  the  smallest  radius 
which  has  a  non-zero  value  in  the  cross-cloud  function. 
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Figure  2:  Measurements  obtained  from  self-plots:  (a)  Line,  and  (b)  Super-cluster  datasets. 

Property  8  Similarly,  the  maximum  distance  between  the  datasets  (or,  the  maximum  surrounding  diameter) 
can  be  accurately  estimated  as  the  greatest  radius  before  the  plot  turns  flat. 

Property  9  Whenever  the  cross-cloud  plot  has  a  flat  part  for  very  small  radii,  there  are  duplicate  points 
across  both  datasets. 

All  the  previous  estimates  can  be  obtained  with  a  single  processing  pass  over  both  datasets  to  count  grid 
occupancies,  without  explicitly  computing  any  distances. 

Property  10  The  steepness  of  the  cross-cloud  plot  is  always  greater  than  or  equal  to  that  of  the  steepest 
self-plot. 

4  Practical  usage  -  Cloud  mining 

Before  presenting  our  main  analysis  process,  we  need  to  define  some  terms: 

Definition  7  The  shape,  of  a  dataset  refers  to  its  formation  law  (eg.  “line,”  “square,”  “sierpinsky”). 
Definition  8  7vvo  datasets  are  collocated  if  they  have  (highly)  overlapping  minimum  bounding  boxes. 
Definition  9  The  placement  of  a  dataset  refers  to  its  position  and  orientation. 

We  use  these  three  terms  when  comparing  two  datasets.  Two  datasets  can  have  the  same  shape  but  different 
placement  (eg.  two  non-collinear  lines).  Two  datasets  have  the  same  shape  but  different  placement,  if  the 
one  can  be  obtained  from  the  other  through  affine  transformations.  Also,  two  datasets  with  the  same  intrinsic 
dimensionality  can  have  different  shapes  (eg.  a  line  and  a  circle  -  both  have  D2  =  1). 
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Figure  3:  Example  of  a  tri-plot  indicating  where  to  find  meaningful  information.  The  cross-plot  is  always  in 
blue  with  diamonds,  Self  in  green  with  crosses  and  Self  ^  in  red  with  squares. 

4.1  Rules  for  tri-plot  analysis 

In  this  section  we  present  rules  (see  table  3  for  a  summary)  to  analyze  and  classify  the  relationship  between 
two  datasets.  From  the  tri-plots  we  can  get  information  about  the  intrinsic  structure  and  the  global  relationship 
between  the  datasets. 

Rule  1  (identical)  If  both  datasets  are  identical,  then  all  plots  of  a  tri-plot  are  similar  {Self  «  Self  ^  ^ 
Cross).  In  this  case,  the  three  graphs  will  be  on  top  of  each  other.  This  means  that  the  intrinsic  dimensionality, 
shape  as  well  as  placement  of  both  datasets  are  the  same.  This  may  be  because  one  dataset  is  a  subset  of  the 
other,  or  both  are  samples  from  a  bigger  one.  Fig.  4  shows  the  tri-plots  for  (a)  two  lines  with  different  number 
of  objects,  (b)  two  Sierpinsky  triangles,  and  (c)  two  coplanar  squares  in  3D.  All  datasets  in  fig.  4  are  in  a  2D 
manifold.  In  all  fhese  examples,  bofh  dafasefs  have  fhe  same  shape  and  placemenl  buf  differenl  number  of 
poinfs. 

Rule  2  (same  shape,  different  placement)  If  both  datasets  have  the  same  intrinsic  dimensionality,  but 
different  placement,  then  their  steepness  is  similar  {Self  «  Self  g),  but  Cross  is  only  moderately  steeper 
than  both.  Further  analysis  using  the  pq-plot  can  indicate  whether  the  datasets  are  separable  or  not  and,  if 
separable,  to  what  extent.  Examples  are  intersecting  lines,  intersecting  planes,  or  two  Sierpinsky  datasets 
with  one  rotated  over  the  other  (see  fig.  5a,  5b  and  5c,  respectively). 
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Rule 

Situation 

Condition 

Example 

A  and  B  are  similar  (Self  and  Self  ^  have 
same  steepness),  and 

1 

Dafasefs  A  and  B  are  sfafisfi- 
cally  idenfical 

Cross,  Self ^  and  Self ^  have  the  same 
steepness 

Figure  4 

2 

Bofh  dafasefs  have  fhe  same  in- 
frinsic  dimensionalify 

Cross  has  steepness  comparable  to  that  of 
Selfj^  and  Self  ^ 

Figure  5 

3 

The  dafasefs  are  disjoinf 

Cross  is  much  steeper  than  both  Self  and 

Selfs 

Figure  6 

A  and  B  are  not  similar  (Self  and  Self  s 
have  different  steepness),  and 

4 

The  (less  steep)  dataset  is  a 
proper  subset  of  the  other 

Cross  and  Selfy^  or  Selfs  have  the  same 
steepness 

Figure  7 

5 

The  datasets  are  collocated 

Cross  has  steepness  comparable  to  that  of 
Self  and  Self  s 

Figure  8 

3 

Cross  is  much  steeper  than  both 
Self  and  Self  ^ 

The  datasets  are  disjoint 

Figure  6 

Table  3:  Conditions  and  rules  used  in  tri-plot  analysis. 


Rule  3  (disjoint  datasets)  If  the  datasets  are  disjoint,  then  Cross  is  much  steeper  than  both  Self  ^  and 
Self  B  (does  not  matter  whether  the  latter  are  similar  or  not).  For  two  intersecting  datasets,  the  Cross  steep¬ 
ness  will  not  be  so  far  from  the  steepness  of  their  self-plots.  However,  if  the  Cross  is  much  steeper  than  both 
Self  and  Self  it  means  that  the  minimum  distance  between  points  from  the  datasets  is  bigger  than  the 
average  distance  of  the  nearest  neighbors  of  points  in  both  datasets,  so  the  datasets  are  disjoint.  In  fact,  this 
case  leads  to  the  conclusion  that  both  datasets  are  well-defined  clusters,  hence  they  should  be  separable  by 
traditional  clustering  techniques.  Examples  of  this  situation  are  non-intersecting  lines,  squares  far  apart,  or 
a  Sierpinsky  triangle  and  a  plane  which  is  not  coplanar  with  the  Sierpinsky’s  supporting  plane  (see  fig.  6a 
fo  6c).  All  dafasefs  are  in  3D  space.  Notice  fhaf  fhe  self-plols  have  fhe  expecfed  slopes,  buf  fhe  cross-plofs 
have  very  high  steepness  (18,  13  and  26  respectively). 


LineSK  X  LinelOK 


SierpinskylOK  X  Sierpinsky9K 


Figure  4:  Rule  1  -  The  fwo  dafasefs  have  fhe  same  shape  and  placemenf:  (a)  Two  superimposed  lines  (all 
plofs  have  slope  1,  (b)  Two  superimposed  Sierpinsky  friangles  (all  plofs  have  slopes  «  1.64  log  3/  log  2, 
(c)  Two  superimposed  squares  (all  plofs  have  slopes  2.  All  dafasefs  are  in  2D  space,  and  fhe  axes  of  all 
fri-plofs  are  in  log-log  scale. 


log(radll) 


LineSK  X  CircumfSK 


log(radll) 


log(radll) 


Figure  5:  Rule  2  -  The  two  datasets  have  the  same  intrinsic  dimensionality,  but  different  placements:  (a)  Two 
intersecting  circumferences  in  2D  space,  (b)  A  line  crossing  a  circumference  in  2D  space,  (c)  Two  piercing 
planes  in  3D  space.  The  upper  row  shows  the  tri-plots  with  the  axes  in  log-log  scale.  The  lower  row  shows 
the  corresponding  datasets  in  their  respective  spaces. 


Rule  4  (sub-manifold)  Without  loss  of  generality,  let  Self  be  the  steepest  of  Self  and  Self  If  dataset 
5  is  a  sub-manifold  of  dataset  A,  the  self-plots  do  not  have  similar  steepness  (Self  y^  96  Self  and  the  Cross 
is  equal  to  Self  y^^.  Remember  that  the  steepness  of  the  Cross  cannot  be  smaller  than  the  steepness  of  Self  or 
Self  Therefore,  if  the  steepness  of  the  Cross  is  similar  to  one  of  the  self  steepnesses  (eg.  Cross  «  Self  yf), 
then  the  other  graph  (in  this  case  Self  q)  will  be  less  steep  than  Cross.  This  means  that  the  points  in  dataset 
B  have  a  stronger  correlation  than  the  points  in  dataset  A.  Rule  1  deals  with  the  situation  where  both  datasets 
are  subsets  of  a  larger  one,  or  one  is  a  subset  of  another,  but  there  is  no  rule  to  extract  the  subsets.  Rule  4 
deals  with  the  same  case  of  occurrence  of  subsets,  but  here  there  are  rules  to  choose  points  that  pertain  to  the 
dataset  with  a  smoother  self-plot.  Examples  of  this  case  are  a  line  embedded  in  a  plane,  a  Sierpinsky  dataset 
and  its  supporting  plane,  and  a  square  embedded  in  a  volume  (see  fig.  7a,  7b  and  7c,  respectively). 

Rule  5  (collocated)  If  both  datasets  have  different  shape,  placement  and  intrinsic  dimensionality,  then 
Self  y^  96  Self  ^  and  the  Cross  is  only  moderately  steeper  than  Selfy^  and  Self  In  this  case,  the  datasets  are 
not  related  to  each  other.  They  are,  however,  collocated,  or  at  least  intersecting.  This  means  that  although  part 
of  the  datasets  may  be  separable,  this  would  not  be  true  for  the  entire  dataset,  or  for  both  datasets.  Whenever 
this  situation  occurs,  it  should  be  further  analyzed,  for  example,  using  the  pq-plot.  These  are  the  cases  of 
a  line  with  a  Sierpinsky  triangle,  a  line  piercing  a  square,  and  a  Sierpinsky  intersecting  a  square,  as  fig.  8 
shows. 

4.2  Application  to  real  datasets 

In  the  previous  section  we  described  the  rules,  using  synthetic  datasets  to  build  intuition.  Here  we  apply  them 
to  real  datasets  (see  fig.  9). 
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Line1  OK  X  Line1  OK 


Cross  stp=8.420l— •— 
’Self  Linel  0  stp=0.9999  ■ 
.Self  LinelO  stp=0.9999— e— 


(a) 


SquarelOK  (A)  XSquare  10K  (B) 

Cross  stp=1 3.2877 
Self  Square  1 0K(A)  sfp=1 .9971 
Self  Square  1 0K(B)  s1p=1 .9997 


(b) 


Cross  stp=26.8154 
Self  Slerpinsky  stp=1.5939 
Seif  Square6K  stp=2.0081 


Sierpinsky  X  SquareSK 


(c) 


Figure  6:  Rule  3  -  The  two  datasets  are  disjoint:  (a)  two  non-intersecting  lines,  (b)  two  non-intersecting 
squares,  (c)  a  square  and  a  Sierpinsky  triangle.  The  upper  row  shows  the  tri-plots  with  the  axes  in  log-log 
scale.  The  lower  row  shows  the  corresponding  datasets  in  3D  space. 


Rule  1  (identical)  There  are  four  pairs  of  datasets  which  conform  this  rule:  two  different  subsets  of 
California-political  (fig.  9a),  the  two  galaxy  datasets  (for  logr  G  [—4,4]  -  fig.  9b),  Iris-versicolor  and  Iris- 
virginica  (fig.  9c),  and  two  different  subsets  of  Califomia-water  (fig.  9d). 

Rule  3  (disjoint  datasets)  The  Iris-Versicolor  and  Iris-Setosa  pair  (fig.  9e),  and  the  Democrat  and  Repub¬ 
lican  pair  (fig.  9f)  conform  to  this  rule.  Their  cross-plot  is  much  steeper  than  their  self-plots.  Versicolor  and 
Setosa  species  are  indeed  apart.  Also,  the  Democrat  and  Republican  parties  have  distinct  behavior,  which 
allows  separation  of  their  members.  Thus,  we  conclude  that  these  dataset  pairs  can  be  separated  and  we  can 
estimate  the  minimum  distance  between  them  (see  property  7). 

Rule  4  (sub-manifold)  Fig.  9g  shows  the  tri-plot  of  Califomia-water  and  California-political.  Recall  that 
the  dataset  with  smaller  steepness  is  probably  a  proper  sub-manifold  of  the  one  with  larger  steepness  (or 
of  the  superset  from  which  both  are  samples).  We  thus  conclude  that  California-political  is  a  subset  of 
California-water.  This  makes  sense,  since  many  political  divisions  are  along  water  paths. 

Rule  5  (collocated)  According  to  fig.  9h,  Califomia-raihoad  and  California-political  agree  with  Rule  5. 
This  is  reasonable,  since  railroads  are  built  with  objectives  irrelevant  to  political  divisions.  Also,  the  LC 
datasets  agree  with  Rule  5  and  require  further  analysis.  The  flat  parts  in  fig.  9i  and  in  the  political  self-plot 
(fig.  9h)  indicate  that  these  datasets  possibly  have  duplicate  (or  near-duplicate)  points.  The  Galaxy  datasets 
(fig.  9bb)  demonstrate  the  case  of  clusters,  which  are  present  at  two  characteristic  distances.  Also,  the  datasets 
repel  each  other  for  radii  close  to  the  cluster  diameter.  After  analyzing  the  relationship  between  two  datasets 
using  tri-plots,  more  information  can  be  obtained  from  the  pg'-plots. 
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Figure  7:  Rule  4  -  One  dataset  is  a  proper  subset  of  the  other  dataset:  (a)  a  square  overlapping  a  line  in 
2D  space,  (b)  a  Sierpinsky  triangle  and  its  supporting  plane  in  2D  space,  (c)  a  volume  travesed  by  a  plane 
in  3D  space.  The  upper  row  shows  the  tri-plots  in  log-log  scale.  The  lower  row  shows  the  datasets  in  their 
respective  spaces. 

4.3  Analysis  of  the  pg-plot 

The  pq-plot  allows  us  to  further  examine  the  relationship  between  two  datasets,  by  weighting  one  dataset 
when  comparing  its  distance  distribution  with  that  of  the  other  dataset.  The  analysis  of  the  pq-plots  is  directed 
to  specific  ranges  of  the  cross-cloud  plots,  in  contrast  to  the  more  global  analysis  of  the  tri-plots. 

Even  if  a  Cross  a, sir,  p,  q)  plot  with  p  ^  1  q  happens  to  be  a  line,  its  slope  has  no  meaning;  only  its 
overall  shape  has  useful  properties.  Also,  due  to  the  normalization  by  log(Ayi  •  NB)/log{N^  •  N^),  both 
the  leftmost  and  rightmost  points  in  all  pq-plots  coincide.  According  to  equation  1,  if  a  particular  CA,i  (or 
Cb^i)  in  the  calculation  of  CrossA,B{'>^,P,  q)  is  zero  for  a  given  radius  r  in  a  given  region  of  the  space,  the 
corresponding  Cb,!  (or  CA,i)  will  not  contribute  to  the  total  for  this  particular  radius.  The  result  will  be  a  flat 
region  in  this  part  of  the  curve.  Otherwise,  if  there  is  a  regular  distribution  of  distances  over  a  continuous 
part  of  the  curve,  the  resulting  curve  will  exhibit  a  linear  shape.  Sudden  rises  in  a  plot  indicate  a  large  growth 
of  counts  starting  at  that  radius.  Hence,  the  two  shapes  in  the  curves  of  the  cross-cloud  plots  that  are  worth 
looking  for  are:  the  linear  parts,  and  the  regions  where  the  curves  are  flat. 

The  cross-cloud  plots.  Cross  a, 3(1",  k,l),  and  Cross  a, sir,  1,  k)  with  A:  3>  1  (which  we  have  named  Wa 
and  Wb  because  they  are  ‘weighted’),  can  be  generated  for  any  value  of  k.  However,  increasing  k  only 
increases  the  distortions  on  the  plot,  without  giving  any  extra  information.  Thus,  we  picked  k  =  10.  Each 
conclusion  is  valid  for  the  range  of  radii  which  presents  specific  behavior.  Next,  we  discuss  two  representative 
situations,  using  pairs  of  synthetic  datasets  and  comparing  the  obtained  tri-plots  and  pq-plots. 

Eig.  10  compares  two  pairs  of  datasets:  circumference-circumference  and  line-circumference.  This  il¬ 
lustrates  the  situation  stated  by  Rule  2:  the  two  datasets  are  similar  (Self  a  ~  Self  b  and  Cross  steepness 
is  less  or  equal  than  the  steepness  of  Self  a  plus  the  steepness  of  Self  b)-  By  looking  only  at  the  tri-plots 
in  fig.  10a  and  lOd,  it  is  not  possible  to  say  anything  else  about  the  datasets.  However,  in  fig.  10b  the  three 
graphs  are  on  top  of  each  other.  This  means  that  both  datasets  have  the  same  behavior  under  weighted  cal- 


11 


(a) 


(b) 


Sierpinsky  X  Squarel  OK 

Cross  stp=3.0049 

♦  ^ 

Self  Sierpinsky  stp=1 .6708 

Self  Squarel  OK  sfp=2.0004 

— H— 

y 

^  / 
/ 

.yy 

(c) 


Figure  8:  Rule  5  -  The  datasets  come  from  different  placements:  (a)  a  line  and  a  Sierpinsky  triangle  in  2D 
space,  (b)  a  line  piercing  a  square  in  3D  space,  (c)  a  plane  and  an  intersecting  Sierpinsky  triangle  in  3D  space. 
The  upper  row  shows  the  tri-plots  in  log-log  scale.  The  lower  row  shows  the  corresponding  datasets  in  their 
respective  space. 


culation  {Cross{r,  1, 10)  and  Cross{r,  10, 1)).  Thus,  both  datasets  have  the  same  shape.  On  the  other  hand, 
the  behavior  of  the  pq-plot  in  fig.  lOe  shows  that  the  datasets  have  different  shapes,  as  well  as  how  they  are 
correlated  within  specific  radii  ranges  (Region  I  and  II  on  fhe  plofs). 

In  fhis  secfion  we  proposed  fhe  rules  fo  analyze  fhe  fri-plofs  and  fhe  pq-plofs  using  easily  undersfandable 
synfhefic  dafasefs  in  2D  and  3D  spaces.  However,  fhe  same  conclusions  should  apply  for  real  dafasefs  in 
any  mulli-dimensional  space.  In  facl,  for  real  dafasefs  if  is  usually  difficull  fo  know  how  fo  describe  fhe 
relafionship  befween  fhe  affribufes  and  fo  know  if  fhey  are  correlafed.  Nonefheless,  our  proposed  analysis 
can  indicafe  nof  only  fhe  exisfence  of  correlafions,  buf  also  how  “fighf”  fhey  are.  This  analysis  can  also 
provide  evidence  of  how  separable  fhe  dafasefs  are,  as  well  as  if  if  is  possible  fo  classify  poinfs  as  belonging 
fo  one  or  fo  fhe  ofher  dafasef. 

4.4  Using  pg-plots  to  analyze  datasets 

Due  fo  space  limifafions,  we  presenf  pq-plofs  only  for  some  of  fhe  real  dalasels(fig.  11).  Fig.  11a  shows  fhe 
pg-plof  for  fhe  Galaxy  dafasefs.  For  fhe  highlighfed  range,  fhere  is  a  disfincf  separation  befween  fhe  dafasefs. 
Besides  confirming  fhaf  fhe  fwo  galaxy  fypes  indeed  repel  each  ofher,  fhe  pq-plofs  show  fhaf  fhere  are  few 
clusfers  consisting  only  of  ‘exp’  galaxies  (alfhough  fhere  are  clusfers  including  poinfs  of  bofh  dafasefs  also 
only  wifh  ‘dev’  poinfs).  Oufside  fhe  highlighfed  range,  fhe  sefs  are  almosf  idenfical.  As  expecfed,  fig.  lib 
confirms  fhaf  fhe  Democraf  and  Republican  dafasefs  are  separable,  since  fhe  weighfed  plofs  have  complefely 
opposife  behaviors. 

Fig.  lie  shows  fhe  pg-plof  of  fhe  Califomia-wafer  and  California-political  dafasefs.  In  fhis  plof,  fhere 
are  four  ranges  wifh  disfincf  behaviors.  Range  I  corresponds  fo  very  small  disfances,  so  fhese  disfances  are 
probably  less  fhan  fhe  resolution  of  fhe  measuremenfs;  Iherefore  fhey  are  nof  meaningful.  Ranges  II  and  III 
are  where  fhe  real  disfances  are  meaningful.  The  sudden  fall  fo  fhe  lefl  of  fhe  wWafer-plof  in  range  II  means 
thaf  fhere  are  very  few  poinfs  in  fhe  polifical  dafasef  af  disfances  below  fhis  range  from  poinfs  in  fhe  wafer 
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LC:  19  atrib  ciassA  X  ciassB 


Figure  9:  Tri-plots  of  real  datasets  and  their  classification  as  obtained  from  rules  1-5. 


dataset.  This  indicates  a  kind  of  “repulsion”  of  points  from  both  datasets  for  these  small  distances.  In  range 
III,  both  datasets  have  approximately  the  same  behavior.  Range  IV  is  almost  flat  for  all  plots,  meaning  that 
there  are  almost  no  more  pairs  within  this  distance  range.  In  fact,  the  “almost  flat”  part  of  the  graph  is  due  to 
a  few  outliers  in  the  dataset. 

4.5  Membership  testing  and  classification 

So  far  we  have  shown  how  to  use  the  tri-plots  to  answer  questions  Q1 -Q5.  In  this  section  we  illustrate  the 
power  of  cross-cloud  plots  in  another  setting:  membership  testing  and  classification  (Q5).  Fig.  12  illustrates 
the  following  situation:  We  have  two  datasets,  A  (20  points  along  a  line)  and  B  (900  points  in  a  ‘tight’ 
square).  A  new  point  (indicated  by  ‘?’)  arrives.  Which  set,  if  any,  does  it  belong  to? 

Visually,  the  new  point  (‘7’)  should  belong  to  the  Line20  set.  However,  nearest  neighbors  or  decision-tree 
classifiers  would  put  it  into  the  square:  the  new  point  has  ~  900  ‘Square’  neighbors,  before  even  the  first 
‘Line20’  neighbor  comes  along! 

We  propose  a  method  that  exploits  cross-cloud  plots  to  correctly  classify  the  new  point  (‘?’).  The  new 
point  is  treated  as  a  singleton  dataset  and  its  cross-plots  are  compared  to  the  self-plots  of  each  candidate  set. 
In  this  particular  case,  we  compare  the  steepness  of  CrossLine,Point  and  C'rosssquare,Point  to  the  steepness 
of  5ef/Ljne  and  •S’ef/gqyg.re  and  classify  the  new  point  accordingly.  Notice  that  the  plots  in  fig.  12b  are  more 
similar  to  each  other  (almost  equal  steepness),  while  the  plots  in  fig.  12c  are  clearly  not  similar.  Thus,  we 
conclude  that  the  new  point  (‘?’)  belongs  to  the  Line20  dataset,  despite  what  A:-nearset  neighbor  classification 
would  say ! 
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Tri-plot:  two  intersecting  Circumferences  pq-plot:  two  intersecting  Circumferences 


Figure  10:  pg-plots  for  two  pairs  of  datasets:  (a)  the  tri-plot  of  two  intersecting  circumferences  (as  shown 
in  (c)),  (b)  the  pq-plot  of  the  two  circumferences,  (d)  the  tri-plot  of  a  line  intersecting  a  circumference  (as 
shown  in  (f)),  and  (e)  the  pq-plot  of  the  line  and  the  circumference. 


The  full  details  of  the  classification  method  are  the  topic  of  ongoing  research.  This  is  yet  another  appli¬ 
cation  of  the  cross-cloud  technique. 

5  Implementation 

To  obtain  the  required  tri-plots,  we  use  the  single-pass  algorithm  presented  in  appendix  A.  This  is  based  on 
box-counting  and  is  an  extension  of  [BF95,  FSJTOO]. 

What  is  important  is  that  this  algorithm  scales  up  for  arbitrarily  large  datasets,  and  arbitrarily  high  dimen¬ 
sions.  This  is  rarely  true  for  other  spatial  data  mining  methods  in  the  literature.  The  algorithm  to  generate  the 
pg-plots  is  very  similar  to  the  algorithm  in  appendix  A,  except  we  construct  Wa  and  Wb  (instead  of  Self  a 
and  Self  b)  plots. 

5.1  Scalability 

The  algorithm  is  linear  on  the  total  number  of  points,  ie.  0{Na  +  Nb)-  If  we  want  I  points  in  each  cross¬ 
cloud  plot  (ie.  number  of  grid  sizes),  then  the  complexity  of  our  algorithm  is  0{{Nb  +  Na)  ■  I  ■  n),  where 
n  is  the  embedding  dimensionality.  Fig.  13  shows  the  wall-clock  time  required  to  process  datasets  on  a 
Pentium  II  machine  running  NT4.0.  The  datasets  on  the  left  graph  have  varying  numbers  of  points  in  2,  8 
and  16-dimensional  spaces,  and  we  used  20  grids  for  each  dataset.  For  the  right  graph,  we  used  datasets  with 
100,000,  200,000  and  300,000  points  and  dimensions  2  to  40.  The  execution  time  is  indeed  linear  on  the 
total  number  of  points,  as  well  as  on  the  dimensionality  of  the  datasets.  The  algorithm  does  not  suffer  from 
the  dimensionality  curse. 

Notice  that  steps  1  and  2  of  the  algorithm  read  the  datasets  and  maintain  counts  of  each  non-empty  grid 
cell.  These  counts  can  be  kept  in  any  data  structure  (hash  tables,  quadtrees,  etc). 
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Figure  11:  pg-plots  for  real  datasets:  (a)  Galaxy,  (b)  Democrat  and  Republican,  (c)  California-water  and 
California-political.  The  upper  row  shows  the  tri-plots  and  the  lower  row  the  corresponding  pg-plots.  The 
axes  are  in  log-log  scales. 

6  Conclusions 

We  have  proposed  the  cross-cloud  plot,  a  new  tool  for  spatial  data  mining  across  two  n-dimensional  datasets. 
We  have  shown  that  our  tool  has  all  the  necessary  properties: 

•  It  can  spot  whether  two  clouds  are  disjoint  (separable),  statistically  identical,  repelling,  or  in-between. 
That  is,  it  can  answer  questions  Q1  to  Q4  from  section  1. 

•  It  can  be  used  for  classification  and  is  capable  of  “learning”  a  shape/cloud,  where  traditional  classifiers 
fail  to  do  so  (ie.  it  can  answer  question  Q5). 

•  It  is  very  fast  and  scalable:  We  use  a  box-counting  algorithm,  which  requires  a  single  pass  over  each 
dataset,  and  the  memory  requirement  is  proportional  to  the  number  F  of  non-empty  grid  cells  and  to 
the  number  I  of  grid  sizes  requested  (1  <  F  <  Na  +  Nb,  and  clearly  not  exploding  exponentially). 


Figure  12:  Classifying  a  point  as  either  belonging  to  a  sparse  line  or  to  a  dense  square,  using  the  cross-cloud 
method:  (a)  spatial  placement  of  the  incoming  point  and  the  datasets,  (b)  and  C'rossLine,Point  plot, 

(c)  Self  gquai-e  GrOSSSquare,Point  plot. 
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Figure  13:  Left  -  Wall-clock  time  (in  seconds)  needed  to  generate  the  tri -plots  for  varyingly  sized  datasets. 
The  blue  graph  represents  the  time  for  2D  datasets,  the  green  graph  for  8D  datasets  and  the  red  graph  for  16D 
datasets.  Right  -  Wall-clock  time  (in  seconds)  needed  to  generate  the  Tri-plots  versus  the  dimensionality  of 
the  datasets,  for  three  different  dataset  sizes  (100,000,  200,000  and  300,000). 


•  Tri-plots  can  be  applied  to  high-dimensional  datasets  easily,  because  the  algorithms  scale  linearly  with 
the  number  of  dimensions. 

The  experiments  on  real  datasets  show  that  our  tool  finds  patterns  that  no  other  known  method  can.  We 
believe  that  our  cross-cloud  plot  is  a  powerful  tool  for  spatial  data  mining  and  that  we  have  just  seen  only  the 
beginning  of  its  potential  uses. 
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Algorithm:  Fast  tri-plot 

Inputs:  Two  datasets,  A  and  B  (with  Na  and  Nb  points  respectively)  normalized  to 

the  unit  hyper-cube,  and  the  number  I  of  desired  points  in  each  plot. 

Output:  Tri-plot 

Begin 

1  -  For  each  point  a  of  dataset  A\ 

For  each  grid  size  r  =  1/2-^,  j  =  1,  2, . . .  , 

Decide  which  grid  cell  it  falls  in  (say,  the  i-th  cell) 

Increment  the  count  Ca.i 

2  -  For  each  point  b  of  dataset  B: 

For  each  grid  size  r  =  1/2-^,  j  =  1,  2, . . .  ,1: 

Decide  which  grid  cell  it  falls  in  (say,  the  i-th  cell) 

Increment  the  count  CB,i 

3  -  Compute  the  sum  of  product  occupancies  for  the  functions: 

SelfA{r)=logaY:^CA,^■{CA,^-l)), 

Self B{r)=  log  {lZ^CB,^■{CB,^-l)), 

Cross  A, sir)  =  log  CA,i  ■  CB,i) 

4  -  Print  the  tri-plot: 

for  r  =  1/2C  j  =  1,2, . . .  ,  1: 

Print  Cross  A, sir) 

Print  Self  A  normalized:  Self  A{r)  -|-  log(A'B/iV^) 

Print  Self  B  normalized:  Self  Bir)  -|-  \og{NA/NB) 

End 


Figure  14:  Algorithm 
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A  Algorithm 

Given  two  datasets  A  and  B  (with  cardinalities  Na  and  A's)  in  a  n-dimensional  space,  we  generate  the  tri¬ 
plot  (ie.  Cross  A, B,  Self  a  and  Self  b  plots)  using  the  algorithm  shown  in  Figure  14.  Note  that  the  number  F 
of  non-empty  cells  in  each  grid  does  not  depend  on  the  dimensionality  n.  In  fact,  1  <  F  <  Na  +  Nb. 
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